Red Wine Dataset Exploration by Rustem Krykbaev

Univariate Plots Section

## [1] 1599   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

All variables have numeric or integer values. Variables represent chemical compositions of wines. Output variable (based on sensory data): quality (score between 0 and 10). Quality score takes values from 3 to 8 out of 10. Other variables are dispersed to a different degree. Variables are measured to different amounts of significant numbers, have zero, one or two decimal points.

Quality scores spread between 3 and 8, majority in 5-7 range. There are only 10 scores available. It makes sense to create a categorical variable out of scores for further analysys.

Citric acid distribution looks multimodal with few peaks at 0, 0.25 and 0.5. It would make sense to test how this is related to other variables, including ‘quality’. There are some outliers.

pH values distributed bell-shaped. Most fall into 3.0-3.75 range.

‘volatile.acidity’ may have a bimodal distribution. It may be worth looking how different peaks are related to the rest of the data.

‘free.sulfur.dioxide’, concentration of free sulfur dioxide, has skewed to the left distribution. X-axis transformation does not change the distribution significantly.

Total sulfur dioxide distribution in similar to the distribution of the free one. Those two variables may be dependent on each other.

Histograms were plotted with the bin sizes adjusted for the maximal resolution allowed by data. Variables have different distributions. “density” appears to be bell-shaped, “quality” is probbaly bell-shaped too, but it is hard to tell for sure with that few sensory scores. “residual.sugar”, “chlorides” and “sulphates” are bell-shaped-like for the majority of values with very few values in the tails on the right side. “fixed.acidity”, “total.sulfur.dioxide”, “alcohol”, “free.sulfur.dioxide” are skewed to the left. “volatile.acidity” and “pH” could be bimodal. “citric.acid” looks most interesting and without definitive shape. There are spikes at zero and 0.50 values which may reflect some specifics of the manufacturing process of those varieties which are unavalable to us for analysys. It would be interesting to find out what accounts for that particulr shape of the citric acid level distribution and how it correlates with the “quality” values.

Univariate Analysis

What is the structure of your dataset?

There are 1599 wines and 12 features: “fixed.acidity”,“volatile.acidity”, “citric.acid”, “residual.sugar”, “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide”, “density”, “pH”, “sulphates”, “alcohol” and “quality”. All variables are numeric or integer types.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is a sesnsory assesment of wine quality (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). In this dataset all values falls into the interval between 3 and 8, most values are in the 5-7 range.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

All other features are concentrations of various chemicals each of which and their combinations can influence taste, so probably all of them are important, although some of them may be related to each other, e.g. “free.sulfur.dioxide” and “total.sulfur.dioxide”, “pH” and amounts of acids.

Did you create any new variables from existing variables in the dataset?

I created a variable called “quality.category” which is a categorical representation of the “quality” variable. I beloieve it makes sense because the “quality” is a sensory assessment and there are only few steps on a scale.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

There were some skewed distributions, such as “free.sulfur.dioxide”, “total.sulfur.dioxide”, “alcohol”. “volatile.acidity” may be bimodal. “citric.acid” does not have definitive shape. There was no need to perform operations to adjust the form of data.

Bivariate Plots Section

There are some noticeable trends revealed by boxplots. Some features seem to correlate with the quality score. Higher rated wines have higher alcohol and citric acid levels and lower levels of pH, density and “volatile.acidity” (acetic acid). Lower pH could be associated with higher alcohol levels, as a result of the fermentation process and also with higher citric acid content. It would be interesting to see which of those features is the most important for the sensory score. Rest of the features, such as “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide” may not have a correlation with “quality”. Interestingly, relatively large number of outliers seem to be associated with quality scores 5, 6 and 7 for a number of features, like “chlorides”, “sulphates”, etc. It looks like there is more variability for mid-range quality wines in non-essential features.

## [1] 0.831146
## [1] 0.6516203

‘fixed.acidity’ means and medians stratified by quality strongly correlates with quality scores which confirm trend observed by examining boxplots

## [1] 0.9831221
## [1] 0.9724308

‘citric.acid’ means and medians stratified by quality strongly correlates too with quality scores which confirm trend observed by examining boxplots

No conclusive difference between two peaks

Scatter plots of quality scores vs features where many more outliers are located in scores 5, 6 and 7, such as “alcohol’, ‘citric.acid’ or ‘volatile.acidity’, reveal that there are many more values located in these categories. Very small numbers of values fall into quality categories 3, 4 and 8. This is due to the fact that most wines are mid-range quality. This probably explains in part that the observation in a boxplot section (above) that mid-range wines features have more outliers. There are many more values for features in those categories. I am wondering if that trend would percist if there were comparable amounts of wines in each quality category.

Histograms faceted by quality scores provide essentialy the same information as boxplots/scatter plots.

Some of the features are related and some not.

## [1] -0.2561309

Not much correlation here. This is understandable, fixed.acidity - tartaric acid and volatile.acidity - acetic acid

## [1] -0.6829782
## [1] -0.7063602

This is rather strong correlation, even stronger with log10 transformation. It is somewhat expected, pH is the measure of the acidity in the solution. The lower the pH (higher acidity) the higher concentration of the tartaric acid, which explains negative correlation.

## [1] 0.2349373

Some correlation between pH and acidic acid, but not a very strong one. For some reason correlation is positive. Other factors may be involved.

## [1] 0.2056325

Weak correlation between pH and alcohol. The reason could be that most CO2 is being eliminated during fermentation.

## [1] 0.1099032

No relation between alcohol and citric acid.

## [1] -0.08565242

No correlation here.

## [1] 0.04207544

I would think that there might be relation between sugar that remains after fermentation and the alcohol because sugare is a substrate for fermentation, but apparently there is none here.

## [1] 0.6676665

Strong correlation here which is expected.

## [1] 0.3552834

Moderate correlation, which is expected.

## [1] -0.5419041

Substantial negative correlation, which is expected.

## [1] 0.6717034

Interestingly, quite strog correlation between tartaric acid and citric acid concentrations.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

There are some noticeable trends revealed by boxplots. Some features seem to correlate with the quality score which is a feature of interest in this dataset. Higher rated wines have higher alcohol, tartaric and citric acid levels and lower levels of pH, density and “volatile.acidity” (acetic acid). For instance, r=0.83 for correlation between medians of fixed.acidity in each quality category and quality score and even stronger correlation (r=0.98) for citric.acid medians and quality score. It would be interesting to see which of those features is the most important for the sensory score. Rest of the features, such as “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide” may not have a correlation with “quality”. Relatively large number of outliers seem to be associated with quality scores 5, 6 and 7 for a number of features, like “chlorides”, “sulphates”, etc. Same observations were made by scatter plotrs as well. Scatter plots of quality scores vs features where many more outliers are located in scores 5, 6 and 7, such as “alcohol’, ‘citric.acid’ or ‘volatile.acidity’, reveal that there are many more values located in these categories. Very small numbers of values fall into quality categories 3, 4 and 8. This is due to the fact that most wines are mid-range quality. This probably explains in part that the observation in a boxplot section (above) that mid-range wines features have more outliers. There are many more values for features in those categories. I am wondering if that trend would percist if there were comparable amounts of wines in each quality category.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I would think that there might be relation between sugar that remains after fermentation and the alcohol level because sugar is a substrate for fermentation. For instance if fermentations starts with the same initial amount of sugar in each wine, then more alcohol would mean less remaining sugar, but apparently there is no relation here. It could be that the initial sugar amounts are very different in each wine before fermentation or wines are the combinations of different varieties after the fermentation ended.

There is some correlation between pH and acidic acid, but not a very strong one. For some reason correlation is positive. One would expect a negative correlation here. Lower pH should correlate with higher amounts of acid which is the case here with tartaric acid, but not acetic acid. Apparently the relationship could be more complex in nature. Other influencing factors may be involved.

Strong relationship between tartaric acid and pH. Apparently tartaric acid have a significant influence on pH.

Interestingly, quite strog relationship between tartaric acid and citric acid concentrations. Tartaric acid (fixed.acidity) and citric acid concentrations are distributed quite differently, but are strongly correlated with each other.

What was the strongest relationship you found?

There is rather strong correlation (r = -0.68), even stronger with log10 transformation (r = -0.71) between pH and tartaric acid concentration. It is somewhat expected, pH is the measure of the acidity in the solution. The lower the pH (higher acidity) the higher concentration of the tartaric acid, which explains negative correlation. Interestingly, quite strog correlation (r = 0.67) between tartaric acid and citric acid concentrations. Also a strog relationship was between free and total sulfur dioxide concentrations (r = 0.67), which is expected too.

Multivariate Plots Section

Most higher quality wines associated with higher citric acid level and lower pH in the upper left corner

Most higher quality wines associated with higher tartaric acid level and lower pH in the upper left corner

Same conclusion as of above scatter plot can be drown analysing pH/fixed.acidity ratio density plot

Most higher quality wines associated with higher tartaric and citric acids levels in the upper right corner

Higher quality associated with higher alcohol and lower pH

Same conclusion as of above scatter plot can be drown analysing alcohol/pH ratio density plot

Wines with average quality (5-6) seem to locate mostly in the middle of density and pH ranges and higher quality wines are located throught the range of high density/lower pH to low density/high pH

According to a density plot there may be some trend in quality wines locating in lower pH/density ratio area, which corresponds to higher density/lower pH of above scatter plot. This was not obvious when analysing boxplots of ‘pH’ and ‘density’ vs ‘quality’ and a scatter plot of ‘pH’ over ‘density’.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Most higher quality wines were associated with higher citric acid level and lower pH. Same was true for a tartaric acid and pH relationship. Also, higher quality was related to higher tartaric and citric acids levels. In addition, higher quality was associated with higher alcohol and lower pH levels.

Were there any interesting or surprising interactions between features?

According to a scatter plot wines with average quality (5-6) seem to locate mostly in the middle of density and pH ranges and higher quality wines are located throught the range of high density/lower pH to low density/high pH. A density plot reveals that there may be some trend in quality wines locating in lower pH/density ratio area, which corresponds to higher density/lower pH of a scatter plot. This was not obvious when analysing boxplots of ‘pH’ and ‘density’ vs ‘quality’ and a scatter plot of ‘pH’ over ‘density’.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

Distributions of two acids are different. Means are indicated by red dashed lines and medians are indicated by blue dashed lines.

## [1] 0.2709756
## [1] 0.26

Mean and median are close to each other for citric acid.

## [1] 8.319637
## [1] 7.9

For tartaric acid mean and median are separated indicating skewness of the distribution. Distribution of tartaric acid is bell-shaped with some skewness and distribution of citric acid is sort of irregular in shape. Sunstantial number of wines do not have acidic acid at all.

Plot Two

Description Two

## [1] 0.9831221

Medians of citric acid concentrations in each sensory score category are correlated with the quality scores (r=0.98). The higher the median of the citric acid, the higher the sensory score.

## Source: local data frame [6 x 2]
## 
##   quality     n
##     (int) (int)
## 1       3    10
## 2       4    53
## 3       5   681
## 4       6   638
## 5       7   199
## 6       8    18

Many more outliers are located in scores 5, 6 and 7 due to the fact that many more values located in general in these categories. Most wines are of a mid-range quality.

Plot Three

Description Three

## [1] 0.6717034

Tartaric acid concentration is related to a citric acid concentration (r=0.67)

## [1] 0.8961178

Even stronger correlation between medians (r=0.90)

Lower tartaric acid corresponds to lower citric acid and higher tartaric acid corresponds to higher citric acid. Higher sensory scores are mostly present in a upper right quadrant, which indicates association of “quality” with higher concentrations of both acids. Those two variable can be used for model building.


Reflection

Red wine dataset has 1599 wine varieties from Portugal. Variables represent chemical compositions of wines. The purpose of the dataset could be to build a predictive model of percived quality of a wine based on its chemical composition. All variables have numeric or integer values. Output variable (based on sensory data): quality (score between 0 and 10). In this dataset quality score takes values from 3 to 8 out of 10. Other variables are dispersed to a different degree. Variables are measured to different amounts of significant numbers, have zero, one or two decimal points. I realized the output variable can be both in numerical and categorical representation. Both kinds can be used for the analysys, so an extra column with a categorical quality score was added to a dataset. Analysys reveals that some variables are associated with the quality score level, such as tartaric and citric acids concentrations, pH, alcohol, density. Other features, such as chloride, residual sugar, sulphates levels do not seem to be associated with the quality. Most of the wines are of a mid-range quality, so low and high quality wines are underrepresented in the dataset. It would certainly help to improve the power of the analysys if more wines added to low and high quality categories. Chemical composition of wines is very complex and many more features can be added, such as various inorganic and organic components, which can improve the analysys and increase a predictive power of the prospective model. To compensate in part for the lack of a comprehensive chemical characterization some other features can be included, such as region of the winery, age of the wine, etc., to improve the power of the analysys.